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Abstract 



Combining information from various image features has 
become a standard technique in concept recognition tasks. 
However, the optimal way of fusing the resulting kernel 
functions is usually unknown in practical applications. 
Multiple kernel learning (MKL) techniques allow to de- 
termine an optimal linear combination of such similarity 
matrices. Classical approaches to MKL promote sparse 
mixtures. Unfortunately, so-called 1-norm MKL variants 
are often observed to be outperformed by an unweighted 
sum kernel. The contribution of this paper is twofold: 
We apply a recently developed non-sparse MKL variant 
to state-of-the-art concept recognition tasks within com- 
puter vision. We provide insights on benefits and limits 
of non-sparse MKL and compare it against its direct com- 
petitors, the sum kernel SVM and the sparse MKL. We 
report empirical results for the PASCAL VOC 2009 Clas- 
sification and ImageCLEF2010 Photo Annotation chal- 
lenge data sets. About to be submitted to PLoS ONE. 
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1 Introduction 

A common strategy in visual object recognition tasks is 
to combine different image representations to capture rel- 
evant traits of an image. Prominent representations are for 
instance built from color, texture, and shape information 
and used to accurately locate and classify the objects of 
interest. The importance of such image features changes 
across the tasks. For example, color information increases 
the detection rates of stop signs in images substantially 
but it is almost useless for finding cars. This is because 
stop sign are usually red in most countries but cars in 
principle can have any color. As additional but nonessen- 
tial features not only slow down the computation time but 
may even harm predictive performance, it is necessary to 
combine only relevant features for state-of-the-art object 
recognition systems. 

We will approach visual object classification from a 
machine learning perspective. In the last decades, support 
vector machines (SVM) [1, 2, 3] have been successfully 
applied to many practical problems in various fields in- 
cluding computer vision [4]. Support vector machines ex- 
ploit similarities of the data, arising from some (possibly 
nonlinear) measure. The matrix of pairwise similarities, 
also known as kernel matrix, allows to abstract the data 
from the learning algorithm [5, 6]. 

That is, given a task at hand, the practitioner needs to 
find an appropriate similarity measure and to plug the re- 
sulting kernel into an appropriate learning algorithm. But 
what if this similarity measure is difficult to find? We note 
that [7] and [8] were the first to exploit prior and domain 
knowledge for the kernel construction. 

In object recognition, translating information from var- 



1 



ious image descriptors into several kernels has now be- 
come a standard technique. Consequently, the choice of 
finding the right kernel changes to finding an appropriate 
way of fusing the kernel information; however, finding 
the right combination for a particular application is so far 
often a matter of a judicious choice (or trial and error). 

In the absence of principled approaches, practitioners 
frequently resort to heuristics such as uniform mixtures 
of normalized kernels [9, 10] that have proven to work 
well. Nevertheless, this may lead to sub-optimal kernel 
mixtures. 

An alternative approach is multiple kernel learning 
(MKL) that has been applied to object classification tasks 
involving various image descriptors [11, 12]. Multiple 
kernel learning [13, 14, 15, 16] generalizes the support 
vector machine framework and aims at learning the opti- 
mal kernel mixture and the model parameters of the S VM 
simultaneously. To obtain a well-defined optimization 
problem, many MKL approaches promote sparse mix- 
tures by incorporating a 1-norm constraint on the mixing 
coefficients. Compared to heuristic approaches, MKL has 
the appealing property of learning a kernel combination 
(wrt. the €i-norm constraint) and converges quickly as it 
can be wrapped around a regular support vector machine 
[15]. However, some evidence shows that sparse kernel 
mixtures are often outperformed by an unweighted-sum 
kernel [17]. As a remedy, [18, 19] propose £ 2 -novm reg- 
ularized MKL variants, which promote non-sparse ker- 
nel mixtures and subsequently have been extended to £ p - 
norms [20, 21]. 

Multiple Kernel approaches have been applied to var- 
ious computer vision problems outside our scope such 
multi-class problems [22] which require mutually exclu- 
sive labels and object detection [23, 24] in the sense of 
finding object regions in an image. The latter reaches its 
limits when image concepts cannot be represented by an 
object region anymore such as the Outdoor, Overall Qual- 
ity or Boring concepts in the ImageCLEF2010 dataset 
which we will use. 

In this contribution, we study the benefits of sparse 
and non-sparse MKL in object recognition tasks. We 
report on empirical results on image data sets from the 
PASCAL visual object classes (VOC) 2009 [25] and Im- 
ageCLEF20 10 Photo Annotation [26] challenges, showing 
that non-sparse MKL significantly outperforms the uni- 
form mixture and i?i-norm MKL. Furthermore we discuss 



the reasons for performance gains and performance limi- 
tations obtained by MKL based on additional experiments 
using real world and synthetic data. 

The family of MKL algorithms is not restricted to 
SVM-based ones. Another competitor, for example, is 
Multiple Kernel Learning based on Kernel Discriminant 
Analysis (KDA) [27, 28]. The difference between MKL- 
SVM and MKL-KDA lies in the underlying single kernel 
optimization criterion while the regularization over kernel 
weights is the same. 

Outside the MKL family, however, within our problem 
scope of image classification and ranking lies, for exam- 
ple, [29] which uses a logistic regression as base crite- 
rion and results in a number of optimization parameters 
equal to the number of samples times the number of input 
features. Since the approach in [29] uses a priori much 
more optimization variables, it poses a more challenging 
and potentially more time consuming optimization prob- 
lem which limits the number of applicable features and 
can be evaluated for our medium scaled datasets in detail 
in the future. 

Alternatives use more general combinations of kernels 
such as products with kernel widths as weighting param- 
eters [30, 31]. As [31] point out the corresponding opti- 
mization problems are no longer convex. Consequently 
they may find suboptimal solutions and it is more diffi- 
cult to assess using such methods how much gain can be 
achieved via learning of kernel weights. 

This paper is organized as follows. In Section 2, we 
briefly review the machine learning techniques used here; 
The following section3 we present our experimental re- 
sults on the VOC2009 and ImageCLEF2010 datasets; in 
Section 4 we discuss promoting and limiting factors of 
MKL and the sum-kernel SVM in three learning scenar- 
ios. 



2 Methods 

This section briefly introduces multiple kernel learning 
(MKL), and kernel target alignment. For more details we 
refer to the supplement and the cited works in it. 
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2.1 Multiple Kernel Learning 

Given a finite number of different kernels each of which 
implies the existence of a feature mapping tpj : X — > Hj 
onto a hilbert space 

kj(x,x) = (%l)j{x),%l)j{x)) ni 

the goal of multiple kernel learning is to learn SVM pa- 
rameters (a, b) and linear kernel weights K = 2~2iPlhl 
simultaneously. 

This can be cast as the following optimization problem 
which extends support vector machines [2, 6] 
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s.t. V? =1 : < a, < C; ^ = 0; 

i=l 

V™ =1 :ft>0; ||/3|| P <1. 

For details on the solution of this optimization problem 
and its kernelization we refer to the supplement and [21]. 

While prior work on MKL imposes a 1-norm constraint 
on the mixing coefficients to enforce sparse solutions ly- 
ing on a standard simplex [14, 15, 32, 33], we employ a 
generalized ^,-norm constraint ||/3|| p < 1 for p > 1 as 
used in [20, 21]. The implications of this modification 
in the context of image concept classification will be dis- 
cussed throughout this paper. 



2.2 Kernel Target Alignment 

The kernel alignment introduced by [34] measures the 
similarity of two matrices as a cosine angle of vectors un- 
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It was argued in [35] that centering is required in order 
to correctly reflect the test errors from SVMs via kernel 
alignment. Centering in the corresponding feature spaces 
[36] can be achieved by taking the product HKH, with 



H:=I--11 T , 



(3) 



/ is the identity matrix of size n and 1 is the column vec- 
tor with all ones. The centered kernel which achieves a 
perfect separation of two classes is proportional to yy T , 
where 
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and n + and n_ are the sizes of the positive and negative 
classes, respectively. 

3 Empirical Evaluation 

In this section, we evaluate £ p -norm MKL in real- 
world image categorization tasks, experimenting on the 
VOC2009 and ImageCLEF2010 data sets. We also pro- 
vide insights on when and why £ p -norm MKL can help 
performance in image classification applications. The 
evaluation measure for both datasets is the average pre- 
cision (AP) over all recall values based on the precision- 
recall (PR) curves. 

3.1 Data Sets 

We experiment on the following data sets: 

1. PASCAL2 VOC Challenge 2009 We use the official 
data set of the PASCAL2 Visual Object Classes Challenge 
2009 (VOC2009) [25], which consists of 13979 images. 
The use the official split into 3473 training, 3581 valida- 
tion, and 6925 test examples as provided by the challenge 
organizers. The organizers also provided annotation of 
the 20 objects categories; note that an image can have 
multiple object annotations. The task is to solve 20 bi- 
nary classification problems, i.e. predicting whether at 
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least one object from a class k is visible in the test im- 
age. Although the test labels are undisclosed, the more 
recent VOC datasets permit to evaluate AP scores on the 
test set via the challenge website (the number of allowed 
submissions per week being limited). 

2. ImageCLEF 2010 PhotoAnnotation The Image- 
CLEF2010 PhotoAnnotation data set [26] consists of 
8000 labeled training images taken from flickr and a test 
set with undisclosed labels. The images are annotated 
by 93 concept classes having highly variable concepts — 
they contain both well defined objects such as lake, river, 
plants, trees, flowers, as well as many rather ambigu- 
ously defined concepts such as winter, boring, architec- 
ture, macro, artificial, motion blur, — however, those con- 
cepts might not always be connected to objects present 
in an image or captured by a bounding box. This makes 
it highly challenging for any recognition system. Un- 
fortunately, there is currently no official way to obtain 
test set performance scores from the challenge organiz- 
ers. Therefore, for this data set, we report on training 
set cross-validation performances only. As for VOC2009 
we decompose the problem into 93 binary classification 
problems. Again, many concept classes are challenging 
to rank or classify by an object detection approach due 
to their inherent non-object nature. As for the previous 
dataset each image can be labeled with multiple concepts. 

3.2 Image Features and Base Kernels 

In all of our experiments we deploy 32 kernels capturing 
various aspects of the images. The kernels are inspired by 
the VOC 2007 winner [37] and our own experiences from 
our submissions to the VOC2009 and ImageCLEF2009 
challenges. We can summarize the employed kernels by 
the following three types of basic features: 

• Histogram over a bag of visual words over SIFT fea- 
tures (BoW-S), 15 kernels 

• Histogram over a bag of visual words over color in- 
tensity histograms (BoW-C), 8 kernels 

• Histogram of oriented gradients (HoG), 4 kernels 

• Histogram of pixel color intensities (HoC), 5 kernels. 

We used a higher fraction of bag-of-word-based fea- 
tures as we knew from our challenge submissions that 



they have a better performance than global histogram fea- 
tures. The intention was, however, to use a variety of dif- 
ferent feature types that have been proven to be effective 
on the above datasets in the past — but at the same time 
obeying memory limitations of maximally 25GB per job 
as required by computer facilities used in our experiments 
(we used a cluster of 23 nodes having in total 256 AMD64 
CPUs and with memory limitations ranging in 32-96 GB 
RAM per node). 

The above features are derived from histograms that 
contain no spatial information. We therefore enrich the re- 
spective representations by using spatial tilings 1 x 1 , 3 x 
1, 2 x 2, 4 x 4, 8 x 8, which correspond to single levels 
of the pyramidal approach [9] (this is for capturing the 
spatial context of an image). Furthermore, we apply a x 2 
kernel on top of the enriched histogram features, which 
is an established kernel for capturing histogram features 
[10]. The bandwidth of the \ 2 kernel is thereby heuris- 
tically chosen as the mean x 2 distance over all pairs of 
training examples [38]. 

The BoW features were constructed in a standard way 
[39]: at first, the SIFT descriptors [40] were calculated on 
a regular grid with 6 pixel pitches for each image, learning 
a code book of size 4000 for the SIFT features and of size 
900 for the color histograms by fc-means clustering (with 
a random initialization). Finally, all SIFT descriptors 
were assigned to visual words (so-called prototypes) and 
then summarized into histograms within entire images or 
sub-regions. We computed the SIFT features over the 
following color combinations, which are inspired by the 
winners of the Pascal VOC 2008 challenge winners from 
the university of Amsterdam [41]: red-green-blue (RGB), 
normalized RGB, gray-opponentColorl-opponentColor2, 
and gray-normalized OpponentColorl-OpponentColor2; 
in addition, we also use a simple gray channel. 

We computed the 15-dimensional local color his- 
tograms over the color combinations red-green-blue, 
gray-opponentColorl-opponentColor2, gray, and hue (the 
latter being weighted by the pixel value of the value com- 
ponent in the HSV color representation). 

This means, for BoW-S, we considered five color chan- 
nels with three spatial tilings each (1 x 1, 3 x 1, and 2 x 2), 
resulting in 15 kernels; for BoW-C, we considered four 
color channels with two spatial tilings each (lxl and 
3 x 1), resulting in 8 kernels. 
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The HoG features were computed by discretizing the 
orientation of the gradient vector at each pixel into 24 
bins and then summarizing the discretized orientations 
into histograms within image regions [42]. Canny de- 
tectors [43] are used to discard contributions from pix- 
els, around which the image is almost uniform. We com- 
puted them over the color combinations red-green-blue, 
gray-opponentColorl-opponentColor2, and gray, thereby 
using the two spatial tilings 4x4 and 8x8. For the ex- 
periments we used four kernels: a product kernel created 
from the two kernels with the red-green-blue color com- 
bination but using different spatial tilings, another prod- 
uct kernel created in the same way but using the gray- 
opponentColorl-opponentColor2 color combination, and 
the two kernels using the gray channel alone (but differing 
in their spatial tiling). 

The HoC features were constructed by discretiz- 
ing pixel- wise color values and computing their 15 
bin histograms within image regions. To this end, 
we used the color combinations red-green-blue, gray- 
opponentColorl-opponentColor2, and gray. For each 
color combination the spatial tilings 2 x 2, 3 x 1, and 4x4 
were tried. In the experiments we deploy five kernels: a 
product kernel created from the three kernels with differ- 
ent spatial tilings with colors red-green-blue, a product 
kernel created from the three kernels with color combina- 
tion gray-opponentColorl-opponentColor2, and the three 
kernels using the gray channel alone(differing in their spa- 
tial tiling). 

Note that building a product kernel out of \ 2 kernels 
boils down to concatenating feature blocks (but using a 
separate kernel width for each feature block). The in- 
tention here was to use single kernels at separate spatial 
tilings for the weaker features (for problems depending 
on a certain tiling resolution) and combined kernels with 
all spatial tilings merged into one kernel to keep the mem- 
ory requirements low and let the algorithms select the best 
choice. 

In practice, the normalization of kernels is as important 
for MKL as the normalization of features is for training 
regularized linear or single-kernel models. This is owed 
to the bias introduced by the regularization: optimal fea- 
ture / kernel weights are requested to be small, implying 
a bias to towards excessively up-scaled kernels. In gen- 
eral, there are several ways of normalizing kernel func- 
tions. We apply the following normalization method, pro- 



posed in [44, 45] and entitled multiplicative normalization 
in [21]; on the feature-space level this normalization cor- 
responds to rescaling training examples to unit variance, 

K <- -tr(if) - -^1 T K1. (5) 

n n z 

3.3 Experimental Setup 

We treat the multi-label data set as binary classification 
problems, that is, for each object category we trained 
a one-vs.-rest classifier. Multiple labels per image ren- 
der multi-class methods inapplicable as these require mu- 
tually exclusive labels for the images. The respective 
SVMs are trained using the Shogun toolbox [46]. In or- 
der to shed light on the nature of the presented techniques 
from a statistical viewpoint, we first pooled all labeled 
data and then created 20 random cross-validation splits 
for VOC2009 and 12 splits for the larger dataset Image- 
CLEF2010. 

For each of the 12 or 20 splits, the training images were 
used for learning the classifiers, while the S VM/MKL reg- 
ularization parameter C and the norm parameter p were 
chosen based on the maximal AP score on the validation 
images. Thereby, the regularization constant C is op- 
timized by class-wise grid search over C <G {10* | i = 
— 1,-0.5, 0,0. 5,1}. Preliminary runs indicated that this 
way the optimal solutions are attained inside the grid. 
Note that for p = oo the £ p -norm MKL boils down to a 
simple SVM using a uniform kernel combination (subse- 
quently called sum-kernel SVM). In our experiments, we 
used the average kernel SVM instead of the sum-kernel 
one. This is no limitation in this as both lead to identical 
result for an appropriate choice of the SVM regularization 
parameter. 

For a rigorous evaluation, we would have to construct 
a separate codebook for each cross validation split. How- 
ever, creating codebooks and assigning descriptors to vi- 
sual words is a time-consuming process. Therefore, in 
our experiments we resort to the common practice of us- 
ing a single codebook created from all training images 
contained in the official split. Although this could result 
in a slight overestimation of the AP scores, this affects 
all methods equally and does not favor any classification 
method more than another — our focus lies on a relative 
comparison of the different classification methods; there- 
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fore there is no loss in exploiting this computational short- 
cut. 

3.4 Results 

In this section we report on the empirical results achieved 
by ^p-norm MKL in our visual object recognition experi- 
ments. 

VOC 2009 Table 2 shows the AP scores attained on 
the official test split of the VOC2009 data set (scores 
obtained by evaluation via the challenge website). The 
class-wise optimal regularization constant has been se- 
lected by cross-validation-based model selection on the 
training data set. We can observe that non-sparse MKL 
outperforms the baselines i'l-MKL and the sum-kernel 
SVM in this sound evaluation setup. We also report on 
the cross-validation performance achieved on the training 
data set (Table 1). Comparing the two results, one can ob- 
serve a small overestimation for the cross-validation ap- 
proach (for the reasons argued in Section 3.3) — however, 
the amount by which this happens is equal for all meth- 
ods; in particular, the ranking of the compared methods 
(SVM versus ^,-norm MKL for various values of p) is 
preserved for the average over all classes and most of 
the classes (exceptions are the bottle and bird class); this 
shows the reliability of the cross-validation-based eval- 
uation method in practice. Note that the observed vari- 
ance in the AP measure across concepts can be explained 
in part by the variations in the label distributions across 
concepts and cross-validation splits. Unlike for the AUC 
measure, the average score of the AP measure under ran- 
domly ranked images depends on the ratio of positive and 
negative labeled samples. 

A reason why the bottle class shows such a strong de- 
viation towards sparse methods could be the varying but 
often small fraction of image area covered by bottles lead- 
ing to overfitting when using spatial tilings. 

We can also remark that ^1.333-norm achieves the best 
result of all compared methods on the VOC dataset, 
slightly followed by ^1.125-norm MKL. To evaluate the 
statistical significance of our findings, we perform a 
Wilcoxon signed-rank test for the cross-validation-based 
results (see Table 1 ; significant results are marked in bold- 
face). We find that in 15 out of the 20 classes the opti- 
mal result is achieved by truly non-sparse ^,-norm MKL 



(which means p e]l,oo[), thus outperforming the base- 
line significantly. 

ImageCLEF Table 4 shows the AP scores averaged 
over all classes achieved on the ImageCLEF2010 data set. 
We observe that the best result is achieved by the non- 
sparse i? p -norm MKL algorithms with norm parameters 
p = 1.125 and p = 1.333. The detailed results for all 
93 classes are shown in the supplemental material (see 
B.l and B.2.We can see from the detailed results that in 
37 out of the 93 classes the optimal result attained by non- 
sparse lp-norm MKL was significantly better than the sum 
kernel according to a Wilcoxon signed-rank test. 

We also show the results for optimizing the norm pa- 
rameter p class-wise (see Table 5). We can see from the 
table that optimizing the ^,-norm class-wise is beneficial: 
selecting the best p 00 [ class-wise, the result is in- 
creased to an AP of 39.70. Also including ^i-norm MKL 
in the candidate set, the performance can even be lever- 
aged to 39.82— this is 0.7 AP better than the result for the 
vanilla sum-kernel SVM. Also including the latter to the 
set of model, the AP score only merely increases by 0.03 
AP points. We conclude that optimizing the norm param- 
eter p class-wise can improve performance; however, one 
can rely on ^,-norm MKL alone without the need to addi- 
tionally include the sum-kernel-S VM to the set of models. 
Tables 1 and 2 show that the gain in performance for MKL 
varies considerably on the actual concept class. Notice 
that these observations are confirmed by the results pre- 
sented in Tables B.l and B.2, see supplemental material 
for details. 

3.5 Analysis and Interpretation 

We now analyze the kernel set in an explorative manner; 
to this end, our methodological tools are the following 

1 . Pairwise kernel alignment scores (KA) 

2. Centered kernel-target alignment scores (KTA). 

3.5.1 Analysis of the Chosen Kernel Set 

To start with, we computed the pairwise kernel alignment 
scores of the 32 base kernels: they are shown in Fig. 1. 
We recall that the kernels can be classified into the follow- 
ing groups: Kernels 1-15 and 16-23 employ BoW-S and 
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Table 1 : Average AP scores on the VOC2009 data set with AP scores computed by cross-validation on the training set. Bold faces 
show the best method and all other ones that are not statistical-significantly worse. 
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Table 2: AP scores attained on the VOC2009 test data, obtained on request from the challenge organizers. Best methods are 
marked boldface. 
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46.36 


63.10 


60.89 


^1.333 




58.00 


53.87 


43.14 


48.17 


46.54 


63.08 


61.28 


£2 




57.98 


53.47 


40.95 


48.07 


46.59 


63.02 


60.91 






57.30 


53.07 


39.74 


47.27 


45.87 


62.49 


60.55 






person 


pottedplant 


sheep 


sofa 


train 


tvmonitor 




£l 




81.73 


31.57 


36.68 


45.72 


80.52 


61.41 




^1.125 




82.65 


34.61 


41.91 


46.59 


80.13 


63.51 




^1.333 




82.72 


34.60 


44.14 


46.42 


79.93 


63.60 




£2 




82.52 


33.40 


44.81 


45.98 


79.53 


63.26 




£00 




82.20 


32.76 


44.15 


45.69 


79.03 


63.00 
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Table 3: Average AP scores on the VOC2009 data set with norm parameter p class-wise optimized over AP scores on the training 
set. We report on test set scores obtained on request from the challenge organizers. 



oo {1, 00} 


{1.125, 1.333,2} 


{1.125, 1.333,2,oo} 


{1, 1.125, 1.333, 2} all norms from the left 


55.85 55.94 


56.75 


56.76 


56.75 56.76 



Table 4: Average AP scores obtained on the ImageCLEF2010 data set with p fixed over the classes and AP scores computed by 
cross-validation on the training set. 



^p-Norm 1 


1.125 


1.333 


2 00 


37.32 ± 5.87 


39.51 ±6.67 


39.48 ± 6.66 


39.13 ± 6.62 39.11 ±6.68 



BoW-C features, respectively; Kernels 24 to 27 are prod- 
uct kernels associated with the HoG and HoC features; 
Kernels 28-30 deploy HoC, and, finally, Kernels 31-32 
are based on HoG features over the gray channel. We see 
from the block-diagonal structure that features that are of 
the same type (but are generated for different parameter 
values, color channels, or spatial tilings) are strongly cor- 
related. Furthermore the BoW-S kernels (Kernels 1-15) 
are weakly correlated with the BoW-C kernels (Kernels 
16-23). Both, the BoW-S and HoG kernels (Kernels 24- 
25,3 1-32) use gradients and therefore are moderately cor- 
related; the same holds for the BoW-C and HoC kernel 
groups (Kernels 26-30). This corresponds to our original 
intention to have a broad range of feature types which are, 
however, useful for the task at hand. The principle useful- 
ness of our feature set can be seen a posteriori from the 
fact that ^i-MKL achieves the worst performance of all 
methods included in the comparison while the sum-kernel 
S VM performs moderately well. Clearly, a higher fraction 
of noise kernels would further harm the sum-kernel S VM 
and favor the sparse MKL instead (we investigate the im- 
pact of noise kernels on the performance of ^ p -norm MKL 
in an experiment on controlled, artificial data; this is pre- 
sented in the supplemental material. 

Based on the observation that the BoW-S kernel sub- 
set shows high KTA scores, we also evaluated the perfor- 
mance restricted to the 15 BoW-S kernels only. Unsur- 
prisingly, this setup favors the sum-kernel SVM, which 
achieves higher results on VOC2009 for most classes; 
compared to £ p -norm MKL using all 32 classes, the sum- 
kernel SVM restricted to 15 classes achieves slightly bet- 
ter AP scores for 1 1 classes, but also slightly worse for 
9 classes. Furthermore, the sum kernel SVM, ^ 2 -MKL, 



and ^1.333-MKL were on par with differences fairly be- 
low 0.01 AP. This is again not surprising as the kernels 
from the BoW-S kernel set are strongly correlated with 
each other for the VOC data which can be seen in the 
top left image in Fig. 1. For the ImageCLEF data we ob- 
served a quite different picture: the sum-kernel SVM re- 
stricted to the 15 BoW-S kernels performed significantly 
worse, when, again, being compared to non-sparse £ p - 
norm MKL using all 32 kernels. To achieve top state- 
of-the-art performance, one could optimize the scores for 
both datasets by considering the class-wise maxima over 
learning methods and kernel sets. However, since the in- 
tention here is not to win a challenge but a relative com- 
parison of models, giving insights in the nature of the 
methods — we therefore discard the time-consuming op- 
timization over the kernel subsets. 

From the above analysis, the question arises why re- 
stricting the kernel set to the 15 BoW-S kernels affects 
the performance of the compared methods differently, 
for the VOC2009 and ImageCLEF2010 data sets. This 
can be explained by comparing the KA/KTA scores of 
the kernels attained on VOC and on ImageCLEF (see 
Fig. 1 (Right)): for the ImageCLEF data set the KTA 
scores are substantially more spread along all kernels; 
there is neither a dominance of the BoW-S subset in the 
KTA scores nor a particularly strong correlation within 
the BoW-S subset in the KA scores. We attribute this 
to the less object-based and more ambiguous nature of 
many of the concepts contained in the ImageCLEF data 
set. Furthermore, the KA scores for the ImageCLEF data 
(see Fig. 1 (Left)) show that this dataset exhibits a higher 
variance among kernels — this is because the correlations 
between all kinds of kernels are weaker for the Image- 
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Table 5: Average AP scores obtained on the ImageCLEF2010 data set with norm parameter p class-wise optimized and AP scores 
computed by cross-validation on the training set. 



oo {l,oo} 


{1.125, 1.333,2} 


{1.125, 1.333,2,oo} 


{1,1.125,1.333,2} 


all norms from the left 


39.11 ± 6.68 39.33 ±6.71 


39.70 ± 6.80 


39.74 ± 6.85 


39.82 ± 6.82 


39.85 ± 6.88 



CLEF data. 




10 20 30 10 20 30 

Kernel Index Kernel Index 



Figure 1 : Similarity of the kernels for the VOC2009 (Top) and 
ImageCLEF2010 (BOTTOM) data sets in terms of pairwise ker- 
nel alignments (LEFT) and kernel target alignments (RIGHT), 
respectively. In both data sets, five groups can be identified: 
'BoW-S' (Kernels 1-15), 'BoW-C (Kernels 16-23), 'products 
of HoG and HoC kernels' (Kernels 24-27, 'HoC single' (Ker- 
nels 28-30), and 'HoG single' (Kernels 31-32). 

Therefore, because of this non-uniformity in the spread 
of the information content among the kernels, we can 
conclude that indeed our experimental setting falls into 
the situation where non-sparse MKL can outperform the 
baseline procedures (again, see suuplemental material. 
For example, the BoW features are more informative than 
HoG and HoC, and thus the uniform-sum-kernel-SVM 
is suboptimal. On the other hand, because of the fact 
that typical image features are only moderately informa- 
tive, HoG and HoC still convey a certain amount of com- 
plementary information — this is what allows the perfor- 



mance gains reported in Tables 1 and 4. 

Note that we class-wise normalized the KTA scores to 
sum to one. This is because we are rather interested in a 
comparison of the relative contributions of the particular 
kernels than in their absolute information content, which 
anyway can be more precisely derived from the AP scores 
already reported in Tables 1 and 4. Furthermore, note that 
we consider centered KA and KTA scores, since it was ar- 
gued in [35] that only those correctly reflect the test errors 
attained by established learners such as SVMs. 

3.5.2 The Role of the Choice of ^ p -norm 

Next, we turn to the interpretation of the norm parameter 
p in our algorithm. We observe a big gap in performance 
between ^1.125-norm MKL and the sparse I i-norm MKL. 
The reason is that for p > 1 MKL is reluctant to set ker- 
nel weights to zero, as can be seen from Figure 2. In con- 
trast, £i-norm MKL eliminates 62.5% of the kernels from 
the working set. The difference between the £ p -norms for 
p > 1 lies solely in the ratio by which the less informative 
kernels are down-weighted — they are never assigned with 
true zeros. 

However, as proved in [21], in the computational opti- 
mum, the kernel weights are accessed by the MKL algo- 
rithm via the information content of the particular kernels 
given by a l v -norm-dependent formula (see Eq. (8); this 
will be discussed in detail in Section 4.1). We mention at 
this point that the kernel weights all converge to the same, 
uniform value for p — > 00. We can confirm these theo- 
retical findings empirically: the histograms of the kernel 
weights shown in Fig. 2 clearly indicate an increasing uni- 
formity in the distribution of kernel weights when letting 
p — >• 00. Higher values of p thus cause the weight distri- 
bution to shift away from zero and become slanted to the 
right while smaller ones tend to increase its skewness to 
the left. 

Selection of the £ p -norm permits to tune the strength of 
the regularization of the learning of kernel weights. In this 
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Figure 2: Histogram of kernel weights as output by i!p-norm MKL for the various classes on the VOC2009 data set (32 kernels x 
20 classes, resulting in 640 values): ^i-norm (TOP LEFT)), l\. 125 -norm (TOP RIGHT), ^i.333-norm (BOTTOM LEFT), and f 2 -norm 
(BOTTOM RIGHT). 



sense the sum-kernel SVM clearly is an extreme, namely 
fixing the kernel weights, obtained when letting p — > 00. 
The sparse MKL marks another extreme case: ^ p -norms 
with p below 1 loose the convexity property so that p = 1 
is the maximally sparse choice preserving convexity at the 
same time. Sparsity can be interpreted here that only a few 
kernels are selected which are considered most informa- 
tive according to the optimization objective. Thus, the £ p - 
norm acts as a prior parameter for how much we trust in 
the informativeness of a kernel. In conclusion, this inter- 
pretation justifies the usage of ^,-norm outside the exist- 
ing choices l\ and £2- The fact that the sum-kernel SVM 
is a reasonable choice in the context of image annotation 
will be discussed further in Section 4.1. 

Our empirical findings on ImageCLEF and VOC seem 
to contradict previous ones about the usefulness of MKL 
reported in the literature, where l\ is frequently to be 
outperformed by a simple sum-kernel SVM (for exam- 
ple, see [30]) — however, in these studies the sum-kernel 
SVM is compared to ^i-norm or ^-norm MKL only. In 
fact, our results confirm these findings: £i-norm MKL is 
outperformed by the sum-kernel SVM in all of our ex- 
periments. Nevertheless, in this paper, we show that by 
using the more general ^ p -norm regularization, the pre- 
diction accuracy of MKL can be considerably leveraged, 
even clearly outperforming the sum-kernel SVM, which 



has been shown to be a tough competitor in the past [12]. 
But of course also the simpler sum-kernel SVM also has 
its advantage, although on the computational side only: in 
our experiments it was about a factor of ten faster than its 
MKL competitors. Further information about runtimes of 
MKL algorithms compared to sum kernel SVMs can be 
taken from [47] . 

3.5.3 Remarks for Particular Concepts 

Finally, we show images from classes where MKL helps 
performance and discuss relationships to kernel weights. 
We have seen above that the sparsity-inducing ^i-norm 
MKL clearly outperforms all other methods on the bottle 
class (see Table 2). Fig. 3 shows two typical highly ranked 
images and the corresponding kernel weights as output 
by £i-norm (Left) and ^i.3 33 -norm MKL (Right), re- 
spectively, on the bottle class. We observe that ^i-norm 
MKL tends to rank highly party and people group scenes. 
We conjecture that this has two reasons: first, many peo- 
ple group and party scenes come along with co-occurring 
bottles. Second, people group scenes have similar gra- 
dient distributions to images of large upright standing 
bottles sharing many dominant vertical lines and a thin- 
ner head section — see the left- and right-hand images in 
Fig. 3. Sparse £i-norm MKL strongly focuses on the 
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dominant HoG product kernel, which is able to capture 
the aforementioned special gradient distributions, giving 
small weights to two HoC product kernels and almost 
completely discarding all other kernels. 

Next, we turn to the cow class, for which we have seen 
above that £i.333-norm MKL outperforms all other meth- 
ods clearly. Fig. 4 shows a typical high-ranked image 
of that class and also the corresponding kernel weights 
as output by £i-norm (Left) and ^.333-norm (Right) 
MKL, respectively. We observe that ^i-MKL focuses on 
the two HoC product kernels; this is justified by typical 
cow images having green grass in the background. This 
allows the HoC kernels to easily to distinguish the cow 
images from the indoor and vehicle classes such as car 
or sofa. However, horse and sheep images have such a 
green background, too. They differ in sheep usually being 
black-white, and horses having a brown-black color bias 
(in VOC data); cows have rather variable colors. Here, 
we observe that the rather complex yet somewhat color- 
based BoW-C and BoW-S features help performance — it 
is also those kernels that are selected by the non-sparse 
^1.333-MKL, which is the best performing model on those 
classes. In contrast, the sum-kernel SVM suffers from in- 
cluding the five gray-channel-based features, which are 
hardly useful for the horse and sheep classes and mostly 
introduce additional noise. MKL (all variants) succeed 
in identifying those kernels and assign those kernels with 
low weights. 

4 Discussion 

In the previous section we presented empirical evidence 
that ^p-norm MKL considerably can help performance 
in visual image categorization tasks. We also observed 
that the gain is class-specific and limited for some classes 
when compared to the sum-kernel SVM, see again Tables 
1 and 2 as well as Tables B.l, B.2 in the supplemental 
material. In this section, we aim to shed light on the rea- 
sons of this behavior, in particular discussing strengths 
of the average kernel in Section 4.1, trade-off effects in 
Section 4.2 and strengths of MKL in Section 4.3. Since 
these scenarios are based on statistical properties of ker- 
nels which can be observed in concept recognition tasks 
within computer vision we expect the results to be trans- 
ferable to other algorithms which learn linear models over 




Kernel Index ' Kernel Index 



Figure 3: Images of typical highly ranked bottle images and 
kernel weights from ^i-MKL (left) and ^i.3 33 -MKL (right). 




Kernel Index Kernel Index 



Figure 4: Images of a typical highly ranked cow image and 
kernel weights from 4-MKL (left) and ^.333-MKL (right). 
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kernels such as [28, 29]. 

4.1 One Argument For the Sum Kernel: 
Randomness in Feature Extraction 

We would like to draw attention to one aspect present in 
BoW features, namely the amount of randomness induced 
by the visual word generation stage acting as noise with 
respect to kernel selection procedures. 

Experimental setup We consider the following experi- 
ment, similar to the one undertaken in [30]: we compute 
a BoW kernel ten times each time using the same local 
features, identical spatial pyramid tilings, and identical 
kernel functions; the only difference between subsequent 
repetitions of the experiment lies in the randomness in- 
volved in the generation of the codebook of visual words. 
Note that we use SIFT features over the gray channel that 
are densely sampled over a grid of step size six, 512 vi- 
sual words (for computational feasibility of the cluster- 
ing), and a \ 2 kernel. This procedure results in ten ker- 
nels that only differ in the randomness stemming from the 
codebook generation. We then compare the performance 
of the sum-kernel SVM built from the ten kernels to the 
one of the best single-kernel SVM determined by cross- 
validation-based model selection. 

In contrast to [30] we try two codebook generation pro- 
cedures, which differ by their intrinsic amount of random- 
ness: first, we deploy fc-means clustering, with random 
initialization of the centers and a bootstrap-like selection 
of the best initialization (similar to the option 'cluster' 
in MATLAB's fc-means routine). Second, we deploy ex- 
tremely randomized clustering forests (ERCF) [48, 49], 
that are, ensembles of randomized trees — the latter pro- 
cedure involves a considerably higher amount of random- 
ization compared to fc-means. 

Results The results are shown in Table 6. For both 
clustering procedures, we observe that the sum-kernel 
SVM outperforms the best single-kernel SVM. In par- 
ticular, this confirms earlier findings of [30] carried out 
for fc-means-based clustering. We also observe that the 
difference between the sum-kernel SVM and the best 
single-kernel SVM is much more pronounced for ERCF- 
based kernels — we conclude that this stems from a higher 
amount of randomness is involved in the ERCF clustering 
method when compared to conventional fc-means. The 



standard deviations of the kernels in Table 6 confirm this 
conclusion. For each class we computed the conditional 
standard deviation 

std(A- | Vl = Vj ) + std(K | Vi + Vj ) (6) 

averaged over all classes. The usage of a conditional vari- 
ance estimator is justified because the ideal similarity in 
kernel target alignment (cf. equation (4)) does have a vari- 
ance over the kernel as a whole however the conditional 
deviations in equation (6) would be zero for the ideal ker- 
nel. Similarly, the fundamental MKL optimization for- 
mula (8) relies on a statistic based on the two conditional 
kernels used in formula (6). Finally, ERCF clustering 
uses label information. Therefore averaging the class- 
wise conditional standard deviations over all classes is not 
expected to be identical to the standard deviation of the 
whole kernel. 

We observe in Table 6 that the standard deviations are 
lower for the sum kernels. Comparing ERCF and k-means 
shows that the former not only exhibits larger absolute 
standard deviations but also greater differences between 
single-best and sum-kernel as well as larger differences in 
AP scores. 

We can thus postulate that the reason for the superior 
performance of the sum-kernel SVM stems from aver- 
aging out the randomness contained in the BoW kernels 
(stemming from the visual-word generation). This can be 
explained by the fact that averaging is a way of reduc- 
ing the variance in the predictors/models [50]. We can 
also remark that such variance reduction effects can also 
be observed when averaging BoW kernels with varying 
color combinations or other parameters; this stems from 
the randomness induced by the visual word generation. 

Note that in the above experimental setup each kernel 
uses the same information provided via the local features. 
Consequently, the best we can do is averaging — learning 
kernel weights in such a scenario is likely to suffer from 
overfitting to the noise contained in the kernels and can 
only decrease performance. 

To further analyze this, we recall that, in the compu- 
tational optimum, the information content of a kernel is 
measured by £ p -norm MKL via the following quantity, as 



12 



Method 


Best Single Kernel 


Sum Kernel 


VOC09, k-Means 


AP: 44.42 ± 12.82 


45.84 ± 12.94 


VOC09, k-Means 


Std: 30.81 


30.74 


VOC09, ERCF 


AP: 42.60 ± 12.50 


47.49 ±12.89 


VOC09, ERCF 


Std: 38.12 


37.89 


ImageCLEF, k-Means 


AP: 31.09 ± 5.56 


31.73 ± 5.57 


ImageCLEF, k-Means 


Std: 30.51 


30.50 


ImageCLEF ERCF 


AP: 29.91 ± 5.39 


32.77 ± 5.93 


ImageCLEF ERCF 


Std: 38.58 


38.10 



Table 6: AP Scores and standard deviations showing amount of 
randomness in feature extraction:results from repeated compu- 
tations of BoW Kernels with randomly initialized codebooks 



proved in [21]: 



(3 cx IMir 1 = y£<yml<U«:j!J:jj ■ (7) 

In this paper we deliver a novel interpretation of the above 
quantity; to this end, we decompose the right-hand term 
into two terms as follows: 



"/.'/, /\,, ",//., = ^2 onKijOLj- ^2 (l < /v 



13^3 



i,3\yi=yj 



The above term can be interpreted as a difference of the 
support-vector-weighted sub-kernel restricted to consis- 
tent labels and the support-vector-weighted sub-kernel 
over the opposing labels. Equation 7 thus can be rewritten 

as 



oc 



■ ij\vi=yj i-J\vi¥=yj 



(8) 



Thus, we observe that random influences in the features 
combined with overfitting support vectors can suggest a 
falsely high information content in this measure for some 
kernels. SVMs do overfit on BoW features. Using the 
scores attained on the training data subset we can ob- 
serve that many classes are deceptive-perfectly predicted 
with AP scores fairly above 0.9. At this point, non-sparse 
^ p >i-norm MKL offers a parameter p for regularizing the 
kernel weights — thus hardening the algorithm to become 
robust against random noise, yet permitting to use some 
degree of information given by Equation (8). 



[30] reported in accordance to our idea about overfit- 
ting of SVMs that £ 2 -MKL and ^i-MKL show no gain 
in such a scenario while £i-MKL even reduces perfor- 
mance for some datasets. This result is not surprising 
as the overly sparse i'l-MKL has a stronger tendency to 
overfit to the randomness contained in the kernels / fea- 
ture generation. The observed amount of randomness in 
the state-of-the-art BoW features could be an explanation 
why the sum-kernel SVM has shown to be a quite hard- 
to-beat competitor for semantic concept classification and 
ranking problems. 

4.2 MKL and Prior Knowledge 

For solving a learning problem, there is nothing more 
valuable than prior knowledge. Our empirical findings 
on the VOC2009 and ImageCLEF09 data sets suggested 
that our experimental setup was actually biased towards 
the sum-kernel SVM via usage of prior knowledge when 
choosing the set of kernels / image features. We deployed 
kernels based on four features types: BoW-S, BoW-C, 
HoC and HoG. However, the number of kernels taken 
from each feature type is not equal. Based on our experi- 
ence with the VOC and ImageCLEF challenges we used 
a higher fraction of BoW kernels and less kernels of other 
types such as histograms of colors or gradients because 
we already knew that BoW kernels have superior perfor- 
mance. 

To investigate to what extend our choice of kernels in- 
troduces a bias towards the sum-kernel SVM, we also per- 
formed another experiment, where we deployed a higher 
fraction of weaker kernels for VOC2009. The difference 
to our previous experiments lies in that we summarized 
the 15 BOW-S kernels in 5 product kernels reducing the 
number of kernels from 32 to 22. The results are given 
in Table 7; when compared to the results of the origi- 
nal 32-kernel experiment (shown in Table 1), we observe 
that the AP scores are in average about 4 points smaller. 
This can be attributed to the fraction of weak kernels be- 
ing higher as in the original experiment; consequently, the 
gain from using (^1.333-norm) MKL compared to the sum- 
kernel SVM is now more pronounced: over 2 AP points — 
again, this can be explained by the higher fraction of weak 
(i.e., noisy) kernels in the working set (this effect is also 
confirmed in the toy experiment carried out in supplemen- 
tal material: there, we see that MKL becomes more bene- 
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Class / ^p-norm 


1.333 


OO 


A prrvnl a tip 

ACI (JJJlLlllC 


77 82 + 7 701 


Jf. TO _j_ O IfiQ 
/ VJ.Z.O 1 O.IVJO 


±ji\^y Lie 


50 75 + 1 1 06 

/ 1 1 1 .\J\J 


46 39 + 1 7 37 


Bird 


57 7 + 8 451 

Jltt _L_ O.tjl 


no -t- s 974 


Boat 


Uii.O _1_ 1 J.Z.7 


60 9 + 14 01 


Bottle 


26.14 ± 9.274 


25.05 ± 9.213 


Bus 


68 IS + 77 55 


67 74 + 77 8 


Car 


51 72 + 8 8?? 


40 51 + Q 447 


Cat 


56 69 + 9 1 03 


55 55 ± 9 317 


Chair 


51.67 ± 12.24 


49.85 ± 12 


Cow 


25.33 ± 13.8 


22.22 ± 12.41 


Diningtable 


45.91 ± 19.63 


42.96 ±20.17 


Dog 


41.22 ± 10.14 


39.04 ± 9.565 


Horse 


52.45 ±13.41 


50.01 ± 13.88 


Motorbike 


54.37 ±12.91 


52.63 ± 12.66 


Person 


80.12 ± 10.13 


79.17 ± 10.51 


Pottedplant 


35.69 ± 13.37 


34.6 ± 14.09 


Sheep 


37.05 ± 18.04 


34.65 ± 18.68 


Sofa 


41.15 ± 11.21 


37.88 ± 11.11 


Train 


70.03 ±15.67 


67.87 ± 16.37 


Tvmonitor 


59.88 ± 10.66 


57.77 ± 10.91 


Average 


52.33 ± 12.57 


50.23 ± 12.79 



Table 7: MKL versus Prior Knowledge: AP Scores with a 
smaller fraction of well scoring kernels 

ficial when the number of noisy kernels is increased). 

In summary, this experiment should remind us that se- 
mantic classification setups use a substantial amount of 
prior knowledge. Prior knowledge implies a preselection 
of highly effective kernels — a carefully chosen set of 
strong kernels constitutes a bias towards the sum kernel. 
Clearly, pre-selection of strong kernels reduces the need 
for learning kernel weights; however, in settings where 
prior knowledge is sparse, statistical (or even adaptive, 
adversarial) noise is inherently contained in the feature 
extraction — thus, beneficial effects of MKL are expected 
to be more pronounced in such a scenario. 

4.3 One Argument for Learning the Multi- 
ple Kernel Weights: Varying Informa- 
tive Subsets of Data 

In the previous sections, we presented evidence for why 
the sum-kernel SVM is considered to be a strong learner 
in visual image categorization. Nevertheless, in our ex- 
periments we observed gains in accuracy by using MKL 



for many concepts. In this section, we investigate causes 
for this performance gain. 

Intuitively speaking, one can claim that the kernel non- 
uniformly contain varying amounts of information con- 
tent. We investigate more specifically what information 
content this is and why it differs over the kernels. Our 
main hypothesis is that common kernels in visual concept 
classification are informative with respect to varying sub- 
sets of the data. This stems from features being frequently 
computed from many combinations of color channels. We 
can imagine that blue color present in the upper third of an 
image can be crucial for prediction of photos having clear 
sky, while other photos showing a sundown or a smoggy 
sky tend to contain white or yellow colors; this means 
that a particular kernel / feature group can be crucial for 
some images, while it may be almost useless — or even 
counterproductive — for others. 

However, the information content is accessed by MKL 
via the quantity given by Eq. (8); the latter is a global 
information measure, which is computed over the support 
vectors (which in turn are chosen over the whole dataset). 
In other words, the kernel weights are global weights that 
uniformly hold in all regions of the input space. Explicitly 
finding informative subsets of the input space on real data 
may not only imply a too high computational burden (note 
that the number of partitions of an n-element training set 
is exponentially in n) but also is very likely to lead to 
overfitting. 

To understand the implications of the above to com- 
puter vision, we performed the following toy experi- 
ment. We generated a fraction of p + = 0.25 of posi- 
tively labeled and p_ = 0.75 of negatively labeled 6m- 
dimensional training examples (motivated by the unbal- 
ancedness of training sets usually encountered in com- 
puter vision) in the following way: the features were di- 
vided in k feature groups each consisting of six features. 
For each feature group, we split the training set into an 
informative and an uninformative set (the size is varying 
over the feature groups); thereby, the informative sets of 
the particular feature groups are disjoint. Subsequently, 
each feature group is processed by a Gaussian kernel, 
where the width is determined heuristically in the same 
way as in the real experiments shown earlier in this paper. 

Thereby, we consider two experimental setups for sam- 
pling the data, which differ in the number of employed 
kernels m and the sizes of the informative sets. In both 
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cases, the informative features are drawn from two suf- 
ficiently distant normal distributions (one for each class) 
while the uninformative features are just Gaussian noise 
(mixture of Gaussians). The experimental setup of the 
first experiment can be summarized as follows: 

Experimental Settings for Experiment 1 (3 kernels): 

njfc=i,2,3 - (300, 300, 500), p+ := P(y = +1) = 0.25 

(9) 

The features for the informative subset are drawn 
according to 



(fe) 



N(0.0,a k ) 
N(0A,a k ) 



if Hi = 
if Vi 



(10) 



(11) 



0.3 if fe = 1,2 
^0.4 if fe = 3 

The features for the uninformative subset are drawn 
according to 

f {k) ~ (1 -p+)iV(0.0,0.5)+p + iV(0.4,0.5). (12) 

For Experiment 1 the three kernels had disjoint informa- 
tive subsets of sizes nfe=i,2,3 = (300, 300, 500). We used 
1 100 data points for training and the same amount for test- 
ing. We repeated this experiment 500 times with different 
random draws of the data. 

Note that the features used for the uninformative sub- 
sets are drawn as a mixture of the Gaussians used for the 
informative subset, but with a higher variance, though. 
The increased variance encodes the assumption that the 
feature extraction produces unreliable results on the un- 
informative data subset. None of these kernels are pure 
noise or irrelevant. Each kernel is the best one for its own 
informative subset of data points. 

We now turn to the experimental setup of the second 
experiment: 

Experimental Settings for Experiment 2 (5 kernels): 

rifc=i,2,3,4,5 = (300, 300, 500, 200, 500), 
P+ : =J P(y = +l) = 0.25 

The features for the informative subset are drawn 
according to 



(fe) 



N{0.0,a k ) 
N{m k ,a k ) 



if Vi 
if Vi 



-1 
+1 



(13) 



Experiment 




SVM 


^1.062. 


5-MKL 


t-test p- value 


1 


68.72 


±3.27 


69.49 


±3.17 


0.000266 


2 


55.07 


±2.86 


56.39 


±2.84 


4.7- 10~ 6 



Table 8: Varying Informative Subsets of Data: AP Scores in 
Toy experiment using Kernels with disjoint informative subsets 
of Data 



m k = 



Cfc = 



I 0.4 if fe = 1,2,3 

[0.2 if fe = 4, 5 

'0.3 if fe = 1,2 

0.4 if fc = 3,4, 5 



(14) 



(15) 



The features for the uninformative subset are drawn 
according to 

/(*) ~ (1 - p+)JV(0.0, 0.5) ± P+ N(m k , 0.5) (16) 

As for the real experiments, we normalized the ker- 
nels to having standard deviation 1 in Hilbert space and 
optimized the regularization constant by grid search in 
C <= {10*1* = -2, -1.5,..., 2}. 

Table 8 shows the results. The null hypothesis of equal 
means is rejected by a t-test with a p-value of 0.000266 
and 0.0000047, respectively, for Experiment 1 and 2, 
which is highly significant. 

The design of the Experiment 1 is no exceptional lucky 
case: we observed similar results when using more ker- 
nels; the performance gaps then even increased. Experi- 
ment 2 is a more complex version of Experiment 1 using 
using five kernels instead of just three. Again, the infor- 
mative subsets are disjoint, but this time of sizes 300, 300, 
500, 200, and 500; the the Gaussians are centered at 0.4, 
0.4, 0.4, 0.2, and 0.2, respectively, for the positive class; 
and the variance is taken as a k = (0.3, 0.3, 0.4, 0.4, 0.4). 
Compared to Experiment 1 , this results in even bigger per- 
formance gaps between the sum-kernel SVM and the non- 
sparse £i.0625-MKL. One can imagine to create learning 
scenarios with more and more kernels in the above way, 
thus increasing the performance gaps — since we aim at a 
relative comparison, this, however, would not further con- 
tribute to validating or rejecting our hypothesis. 

Furthermore, we also investigate the single-kernel per- 
formance of each kernel: we observed the best single- 
kernel SVM (which attained AP scores of 43.60, 43.40, 
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and 58.90 for Experiment 1) being inferior to both MKL 
(regardless of the employed norm parameter p) and the 
sum-kernel SVM. The differences were significant with 
fairly small p-values (for example, for ^1.25-MKL the p- 
value was about 0.02). 

We emphasize that we did not design the example in 
order to achieve a maximal performance gap between the 
non sparse MKL and its competitors. For such an exam- 
ple, see the toy experiment of [21], which is replicated in 
the supplemental material including additional analysis. 
Our focus here was to confirm our hypothesis that ker- 
nels in semantic concept classification are based on vary- 
ing subsets of the data — although MKL computes global 
weights, it emphasizes on kernels that are relevant on the 
largest informative set and thus approximates the infeasi- 
ble combinatorial problem of computing an optimal parti- 
tion/grid of the space into regions which underlie identical 
optimal weights. Though, in practice, we expect the situ- 
ation to be more complicated as informative subsets may 
overlap between kernels. 

Nevertheless, our hypothesis also opens the way to 
new directions for learning of kernel weights, namely re- 
stricted to subsets of data chosen according to a mean- 
ingful principle. Finding such principles is one the future 
goals of MKL — we sketched one possibility: locality in 
feature space. A first starting point may be the work of 
[51,52] on localized MKL. 

5 Conclusions 

When measuring data with different measuring devices, it 
is always a challenge to combine the respective devices' 
uncertainties in order to fuse all available sensor informa- 
tion optimally. In this paper, we revisited this important 
topic and discussed machine learning approaches to adap- 
tively combine different image descriptors in a systematic 
and theoretically well founded manner. While MKL ap- 
proaches in principle solve this problem it has been ob- 
served that the standard i^-norm based MKL often cannot 
outperform SVMs that use an average of a large number 
of kernels. One hypothesis why this seemingly unintuitive 
result may occur is that the sparsity prior may not be ap- 
propriate in many real world problems — especially, when 
prior knowledge is already at hand. We tested whether this 
hypothesis holds true for computer vision and applied the 



recently developed non-sparse i v MKL algorithms to ob- 
ject classification tasks. The ^ p -norm constitutes a slightly 
less severe method of sparsification. By choosing pas a 
hyperparameter, which controls the degree of non-sparsity 
and regularization, from a set of candidate values with the 
help of a validation data, we showed that ^,-MKL sig- 
nificantly improves SVMs with averaged kernels and the 
standard sparse l\ MKL. 

Future work will study localized MKL and methods to 
include hierarchically structured information into MKL, 
e.g. knowledge from taxonomies, semantic information 
or spatial priors. Another interesting direction is MKL- 
KDA [27, 28]. The difference to the method studied in the 
present paper lies in the base optimization criterion: KDA 
[53] leads to non-sparse solutions in a while ours leads to 
sparse ones (i.e., a low number of support vectors). While 
on the computational side the latter is expected to be ad- 
vantageous, the first one might lead to more accurate so- 
lutions. We expect the regularization over kernel weights 
(i.e., the choice of the norm parameter p) having similar 
effects for MKL-KDA like for MKL-SVM. Future studies 
will expand on that topic. 1 
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